These are toplots and searchable, sortable datatables to accompany the main text. They are divided into unigram, bigram, and trigram sections. The unigrams section considers the occurrences of segments by syllabic position. The bigram and trigram sections consider (not necessarily sequential) co-occurrences of two or three segments, respectively.
For some n-grams, data are plotted as histograms or heatmaps. These are hidden by default, but can be revealed (and hidden again) by clicking the appropriate button.
In all tables, the data can be copied to the clipboard or saved as a CSV file. There are also additional columns that are hidden by default; these can be revealed using the Show all button, or individuals columns can be selected using the Column visibility button. All columns can also be reordered just by clicking and dragging. The datatables can be sorted by individual columns; the default is descending order by O/E ratio.
Most columns in the unigram, bigram, and trigram sections have the same interpretation:
\[\frac{\text{Count in SV list}}{\text{Count in both lists}} \times \frac{\text{Length of lexicon}}{\text{Length of SV list}}\] Since the lengths of the lexicon and the SV list are both constant, this means that O/E is linearly proportional to the percentage of occurrences in the SV layer (here, SV/total \(\times\) 4.234141).
The advantage of the O/E ratio lies in its intepretability: when O/E \(\approx\) 1, then the n-gram occurs in the SV list about as often as expected, i.e. about 24% of the time (SV/total \(\approx\) 0.24); when O/E < 1, the n-gram occurs less often than expected; and when O/E > 1, it occurs more often than expected.
The unigram tables contain two hidden columns %SV and %NSV, which are the list-specific positional frequencies of each segment. They are not true unigram frequencies. For example, in the Onsets table, k has a %SV of 8.72%; this means that 8.72% of syllables in the SV list begin with k. We treat the onset as obligatory, e.g. oan is treated as having the onset ʔ. The %SV and %NSV columns are hidden by default; use the Show all or Column visibility buttons to reveal them.
The histograms plot each segment’s percentage of the total segments of that type in each layer, ordered by %SV. The histograms are also hidden by default.
The bigram tables contain two columns PMI_SV and PMI_NSV, which give the pointwise mutual information scores for the segment pair in the relevant list. PMI describes the increase or decrease in the cost of describing a segment in a particular environment. Positive PMI for a sequence AB in list L means that when we observe segment A, we are unsurprised to find that segment B occurs after it, whereas negative PMI means that we are more surprised to see B, given that we’ve seen A. PMI is colored green when it exceeds 0.25 and red when it is less than -0.25, but there is nothing inherently special about these values. These columns are hidden by default; use the Show all or Column visibility buttons to reveal them.
Again, all segments are treated as “positionally specific”. That is, final -k and onset k are not the same k for purposes of determining frequencies (and therefore pointwise mutual information). This is partly because what we are interested in is the positional stickiness, and partially because they are arguably different (phonetic) segments.
The heatmaps indicate the number of occurrences of a bigram in a given layer. Hover over a cell in the heatmaps to see the exact count of bigrams for that cell. In the heatmaps only, bigrams with n=1 are not shown.
possible is the count of possible syllables of this shape. What counts as a “possible” syllable? Different ways to do it; here we assume:
ʔ but excluding w; we distinguish orthographic d gi in addition to s x)[aː e əː ɛ i ɨ ɔ o u iə ɨə uə] with unrestricted distribution following plain onsets[a ə] that cannot occur in open syllables[ɗ t tʰ s z l r c ʂ ɲ ʈ k x ɣ ŋ h ʔ] (we treat onset w here like a labialized ʔ for co-occurrence reasons) which may not be followed by [ɨ ɔ o u ɨə uə] (ostensibly the single exception is quốc but it is typically pronounced [kwək])[m n ŋ] and 3 unreleased plosive codas [p t k][w j] with restricted distribution: [j] cannot follow [i iə e ɛ] and [w] cannot follow [əː ɔ o u uə]SV and NSV are the counts of syllables of these shapes in the SV and NSV lists, respectively
%SV and %NSV are the percentages of the possible number of syllables of this shape that occur in the SV or NSV lists, respectively. %possible is simply the sum of %SV and %NSV.
Takeaways:
Trần & Vallée 2009 report that “the prevalent monosyllabic pattern in Vietnamese…was the CVC syllable type, respectively 70% and 34% of the monosyllabic words, and respectively 70% and 20% of the language syllable inventory” (2009:232). Their counts were derived from a list of words with frequency above 2% in a 5,000 word lexicon. If we collapse the above table into their three categories (CV, CVC, CCVC), we see the numbers are quite close: about 21% C(C)V, 71% CVC and 8% CCVC.